Predicting Data Science Salaries

Anh Nguyen, Amira Bendjama, Hong Doan

  1. Introduction and Problem Statement

    The field of data science has experienced remarkable growth in recent years, with organizations across diverse industries recognizing the value of data-driven decision making. According to an article by 365 Data Science, the US Bureau of Labor Statistics estimated that the employment rate for data scientists will grow by 36% from 2021 to 2031. This rate is significantly higher than the average growth rate of 5%, indicating substantial growth and demand for data science talent. The surging demand for data science presents both opportunities and challenges for job seekers, particularly recent graduates. One of the significant hurdles they face is the lack of salary transparency in the data science job market. This opacity creates uncertainty regarding compensation and hinders job seekers’ ability to negotiate fair salaries.

    There are significant variations in data science salaries across different industries and locations. For instance, according to Zippia, data scientists working in the finance and technology sectors tend to earn higher salaries compared to those in other industries. Similarly, the geographical location also plays a crucial role in determining salaries. Large cities with higher concentration of tech companies and living costs such as San Francisco and New York offer higher salaries than smaller cities.

    The discrepancies in data science salaries can also be attributed to various factors, including job responsibilities, experience level, educational background, and specific skill sets. A study conducted by Burtch Works, a leading executive recruiting firm, found that data scientists with advanced degrees, such as Ph.D., tend to command higher salaries compared to those with bachelor’s or master’s degrees. Similarly, professionals with expertise in specialized areas, such as machine learning or natural language processing, often earn higher salaries due to the high demand for these skills.

    According to a report surveyed 1,000 US-based full-time employees, conducted by Visier, 79% of all survey respondents want some form of pay transparency and 32% want total transparency, in which all employee salaries are publicized. However, the 2022 Pay Clarity Survey by WTW found that only 17% of companies are disclosing pay range information in U.S. locations where not required by state or local laws. For the states that have pay transparency laws such as Colorado and New York, there has been a decline in job postings since the law went into effect. Some employers comply with the new laws by expanding the salary ranges, sometimes to ridiculous lengths. These statistics highlight the lack of pay transparency not only in the field of data science, but across multiple job markets. Job seekers often struggle to estimate salaries for data science positions due to the scarcity of reliable information.

    To address this problem, our project aims to develop a predictive model that estimates the salary for data science jobs. By leveraging publicly available data and employing machine learning algorithms, we seek to provide job seekers a better understanding of salary expectations within the data science job market and empower them to negotiate fair and competitive compensation packages.

  2. Data Sources and Data preparation

#install.packages("rpart.plot")
#install.packages("ggplot2")
#install.packages("e1071")
# Install the plotly package
#install.packages("plotly")
library(ggplot2)
ds_salaries <- read.csv("ds_salaries.csv")
summary(ds_salaries)
##        X           work_year    experience_level   employment_type   
##  Min.   :  0.0   Min.   :2020   Length:607         Length:607        
##  1st Qu.:151.5   1st Qu.:2021   Class :character   Class :character  
##  Median :303.0   Median :2022   Mode  :character   Mode  :character  
##  Mean   :303.0   Mean   :2021                                        
##  3rd Qu.:454.5   3rd Qu.:2022                                        
##  Max.   :606.0   Max.   :2022                                        
##   job_title             salary         salary_currency    salary_in_usd   
##  Length:607         Min.   :    4000   Length:607         Min.   :  2859  
##  Class :character   1st Qu.:   70000   Class :character   1st Qu.: 62726  
##  Mode  :character   Median :  115000   Mode  :character   Median :101570  
##                     Mean   :  324000                      Mean   :112298  
##                     3rd Qu.:  165000                      3rd Qu.:150000  
##                     Max.   :30400000                      Max.   :600000  
##  employee_residence  remote_ratio    company_location   company_size      
##  Length:607         Min.   :  0.00   Length:607         Length:607        
##  Class :character   1st Qu.: 50.00   Class :character   Class :character  
##  Mode  :character   Median :100.00   Mode  :character   Mode  :character  
##                     Mean   : 70.92                                        
##                     3rd Qu.:100.00                                        
##                     Max.   :100.00
head(ds_salaries,5)

This dataset has 607 rows and 12 columns

We want to focus on “USD” currency so we keep the “salary_in_usd” column and drop “salary_currency” and “salary” column by using subset()

ds_salaries <- subset(ds_salaries, select = -c(X , salary_currency, salary))
head(ds_salaries, 5)
num_null_rows <- sum(rowSums(is.na(ds_salaries)) == ncol(ds_salaries))
print(num_null_rows)
## [1] 0

There are no null values

repeated_entries <- subset(ds_salaries, duplicated(ds_salaries))
print(repeated_entries)
##     work_year experience_level employment_type                 job_title
## 218      2021               MI              FT            Data Scientist
## 257      2021               MI              FT             Data Engineer
## 332      2022               SE              FT              Data Analyst
## 333      2022               SE              FT              Data Analyst
## 334      2022               SE              FT              Data Analyst
## 354      2022               SE              FT            Data Scientist
## 363      2022               SE              FT              Data Analyst
## 364      2022               SE              FT              Data Analyst
## 371      2022               SE              FT            Data Scientist
## 375      2022               MI              FT             ETL Developer
## 378      2022               SE              FT             Data Engineer
## 386      2022               SE              FT             Data Engineer
## 393      2022               SE              FT              Data Analyst
## 394      2022               SE              FT              Data Analyst
## 407      2022               MI              FT              Data Analyst
## 439      2022               SE              FT Machine Learning Engineer
## 440      2022               SE              FT Machine Learning Engineer
## 444      2022               MI              FT             Data Engineer
## 447      2022               SE              FT             Data Engineer
## 448      2022               SE              FT             Data Engineer
## 474      2022               SE              FT            Data Scientist
## 528      2022               SE              FT              Data Analyst
## 530      2022               SE              FT              Data Analyst
## 537      2022               SE              FT              Data Analyst
## 538      2022               SE              FT             Data Engineer
## 546      2022               SE              FT             Data Engineer
## 548      2022               SE              FT             Data Engineer
## 552      2022               SE              FT            Data Scientist
## 556      2022               SE              FT             Data Engineer
## 567      2022               SE              FT              Data Analyst
## 570      2022               SE              FT            Data Scientist
## 572      2022               SE              FT            Data Scientist
## 573      2022               SE              FT              Data Analyst
## 575      2022               SE              FT            Data Scientist
## 576      2022               SE              FT            Data Scientist
## 577      2022               SE              FT            Data Scientist
## 579      2022               SE              FT             Data Engineer
## 588      2022               SE              FT            Data Scientist
## 589      2022               SE              FT              Data Analyst
## 593      2022               SE              FT            Data Scientist
## 597      2022               SE              FT            Data Scientist
## 598      2022               SE              FT              Data Analyst
##     salary_in_usd employee_residence remote_ratio company_location company_size
## 218         90734                 DE           50               DE            L
## 257        200000                 US          100               US            L
## 332         90320                 US          100               US            M
## 333        112900                 US          100               US            M
## 334         90320                 US          100               US            M
## 354        123000                 US          100               US            M
## 363        130000                 CA          100               CA            M
## 364         61300                 CA          100               CA            M
## 371        123000                 US          100               US            M
## 375         54957                 GR            0               GR            M
## 378        165400                 US          100               US            M
## 386        132320                 US          100               US            M
## 393        112900                 US          100               US            M
## 394         90320                 US          100               US            M
## 407         58000                 US            0               US            S
## 439        189650                 US            0               US            M
## 440        164996                 US            0               US            M
## 444         78526                 GB          100               GB            M
## 447        209100                 US          100               US            L
## 448        154600                 US          100               US            L
## 474        140000                 US          100               US            M
## 528        135000                 US          100               US            M
## 530         90320                 US          100               US            M
## 537        112900                 US          100               US            M
## 538        155000                 US          100               US            M
## 546        115000                 US          100               US            M
## 548        130000                 US          100               US            M
## 552        140400                 US            0               US            L
## 556        160000                 US          100               US            M
## 567        170000                 US          100               US            M
## 570        140000                 US          100               US            M
## 572        140000                 US          100               US            M
## 573        100000                 US          100               US            M
## 575        210000                 US          100               US            M
## 576        140000                 US          100               US            M
## 577        210000                 US          100               US            M
## 579        100000                 US          100               US            M
## 588        140000                 US          100               US            M
## 589         99000                 US            0               US            M
## 593        230000                 US          100               US            M
## 597        210000                 US          100               US            M
## 598        170000                 US          100               US            M

There are 42 duplicate rows

# Remove duplicate rows
df <- ds_salaries[!duplicated(ds_salaries), ]
# check again
repeated_entries_new <- subset(df, duplicated(df))
print(repeated_entries_new)
## [1] work_year          experience_level   employment_type    job_title         
## [5] salary_in_usd      employee_residence remote_ratio       company_location  
## [9] company_size      
## <0 rows> (or 0-length row.names)

Salaries groups

Adding new column to split our salaries into three groups Low , High, Medium.The approach is to use Percentiles by Dividing the dataset based on them. Hence, we are classifying salaries below the 25th percentile as “Low”, salaries between the 25th and 75th percentile as “Medium”, and salaries above the 75th percentile as “High”.

# adding new column 
# Calculate the percentiles
percentiles <- quantile(df$salary_in_usd, probs = c(0.25, 0.75))

# Define the thresholds
low_threshold <- percentiles[1]  # 25th percentile
high_threshold <- percentiles[2]  # 75th percentile

# Create a new column based on percentiles
df$salary_classification <- ifelse(df$salary_in_usd < low_threshold, "Low",
                                   ifelse(df$salary_in_usd > high_threshold, "High", "Medium"))
  1. Data Exploration and Visualization

Top 10 Jobs in the dataset:

# Get top 10 job titles and their value counts
top10_job_title <- head(sort(table(df$job_title), decreasing = TRUE), 10)

top10_job_title_df <- data.frame(job_title = names(top10_job_title), count = as.numeric(top10_job_title))
top10_job_title_df
# Load the required packages
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Define custom color palette
custom_colors <- c("#FF6361", "#FFA600", "#FFD700", "#FF76BC", "#69D2E7", "#6A0572", "#FF34B3", "#118AB2", "#FFFF99", "#FFC1CC")

# Create bar plot
fig <- plot_ly(data = top10_job_title_df, x = ~reorder(job_title, -count), y = ~count, type = "bar",
               marker = list(color = custom_colors), text = ~count) %>%
  layout(title = "Top 10 Job Titles", xaxis = list(title = "Job Titles"), yaxis = list(title = "Count"),
         font = list(size = 17), template = "plotly_dark")

# Adjust layout settings to avoid label overlap
fig <- fig %>% layout(
  margin = list(b = 150),  # Increase bottom margin to provide space for labels
  xaxis = list(
    tickangle = 45,  # Rotate x-axis tick labels
    automargin = TRUE  # Automatically adjust margins to avoid overlap
  )
)

# Display the plot
fig

Experience level categories:

Our Dataset has 4 different experience categories: - EN: Entry-level / Junior - MI: Mid-level / Intermediate - SE: Senior-level / Expert - EX: Executive-level / Director

# Create a mapping of category abbreviations to full names
category_names_experience <- c("EN" = "Entry-level",
                    "MI" = "Mid-level",
                    "SE" = "Senior-level",
                    "EX" = "Executive-level")

# Get the sorted experience data
experience <- head(sort(table(df$experience_level), decreasing = TRUE))

# Replace the category names with full forms
names(experience) <- category_names_experience[names(experience)]

# Calculate the percentage for each category
percentages <- round(100 * experience / sum(experience), 2)

# Define a custom color palette
custom_colors <- c("#FFA998", "#FF76BC", "#69D2E7", "#FFA600")

# Create a pie chart with cute appearance
pie(experience, labels = paste(names(experience), "(", percentages, "%)"), col = custom_colors, border = "white", clockwise = TRUE, init.angle = 90)

# Add a legend with cute colors
legend("topright", legend = names(experience), fill = custom_colors, border = "white", cex = 0.8)

# Add a title with a cute font
title("Experience Distribution", font.main = 1)

### Compnay size distribution

# Create a mapping of category abbreviations to full names
category_names_company <- c("M" = "Medium",
                    "L" = "Large",
                    "S" = "Small"
                   )


# Get the sorted company size data
company_size <- head(sort(table(df$company_size), decreasing = TRUE))

# Replace the category names with full forms
names(company_size) <- category_names_company[names(company_size)]

# Set the maximum value for the y-axis
max_count <- max(company_size)

# Create a bar plot with adjusted y-axis limits
barplot(company_size, col = custom_colors, main = "Company Size Distribution", xlab = "Company Size", ylab = "Count", ylim = c(0, max_count + 10))

### Salaries Distribution

# Set the scipen option to a high value
options(scipen = 10)

# Create boxplot of salaries
bp <- boxplot(df$salary_in_usd / 1000, 
        col = "skyblue", 
        main = "Boxplot of Salaries",
        ylab = "Salary in Thousands USD",
        notch = TRUE)

### Salaries classification Distribution

# Get the sorted salary classification data
salary_classification <- sort(table(df$salary_classification), decreasing = TRUE)


salary_classification_df <- data.frame(salary_classification= names(salary_classification ), count = as.numeric(salary_classification ))

fig <- plot_ly(
  data = salary_classification_df,
  x = ~reorder(salary_classification, -count),
  y = ~count,
  type = "bar",
  marker = list(color = custom_colors),
  text = ~count,
  width = 700,
  height = 400
)

fig <- fig %>% layout(
  title = "Salary Classification Distribution",
  xaxis = list(title = "Salary Classification"),
  yaxis = list(title = "Count"),
  font = list(size = 17),
  template = "ggplot2"
)

fig
# Create a data frame with counts of experience levels by salary classification
experience_salary <- table(df$experience_level, df$salary_classification)

# Define custom colors for each experience level
custom_colors <- c("#69D2E7", "#FFA600", "#FF6361", "#FFD700")

# Create a data frame for the plot
plot_data <- data.frame(Experience = rownames(experience_salary), 
                        Salary_Classification = colnames(experience_salary), 
                        Count = as.vector(experience_salary))

# Convert Count column to numeric
plot_data$Count <- as.numeric(plot_data$Count)

# Create the bar plot
library(plotly)
fig <- plot_ly(data = plot_data, x = ~Salary_Classification, y = ~Count, 
               color = ~Experience, colors = custom_colors, type = "bar") %>%
  layout(title = "Experience Level by Salary Classification",
         xaxis = list(title = "Salary Classification"),
         yaxis = list(title = "Count"),
         font = list(size = 17),
         template = "plotly_dark")

fig
  1. Modeling

    a. Logestic Regression

    b. Random Forest

    c. Decision Tree

  2. Evaluation and Results

    a. Linear Regression

    b. Random Forest

    c. Decision Tree

  3. Major Challenges and Solutions

    • Data is not updated

    • Data is imbalanced

  4. Conclusion and Future Work

  5. References

    The Data Scientist Job Outlook in 2023 | 365 Data Science

    Which Industry Pays the Highest Data Scientist Salary? How To Make The Most Money As A Data Scientist - Zippia

    Burtch-Works-Study_DS-PAP-2019.pdf (burtchworks.com)

    New Visier Report Reveals 79% of Employees Want Pay Transparency (prnewswire.com)

    More NA organizations plan to disclose pay information - WTW (wtwco.com)

    Study: Pay Transparency Reduces Recruiting Costs (shrm.org)